Exploiting Dataset Similarity for Distributed Mining
نویسندگان
چکیده
The notion of similarity is an important one in data mining. It can be used to provide useful structural information on data as well as enable clustering. In this paper we present an elegant method for measuring the similarity between homogeneous datasets. The algorithm presented is eÆcient in storage and scale, has the ability to adjust to time constraints. and can provide the user with likely causes of similarity or dis-similarity. One potential application of our similarity measure is in the distributed data mining domain. Using the notion of similarity across databases as a distance metric one can generate clusters of similar datasets. Once similar datasets are clustered, each cluster can be independently mined to generate the appropriate rules for a given cluster. The similarity measure is evaluated on a dataset from the Census Bureau, and synthetic datasets from IBM.
منابع مشابه
Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering
Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...
متن کاملImprove The Linear Regression Model in Bioinformatics Using Text Mining
Linear regression is a commonly used approach in bioinformatics. One of the main challenge with applying linear regression in bioinformatics is that the number of regression weights needed to be determined is often at least one order of magnitude larger than the number of data points available for training. This sparse data problem often reduce the reliability in determining regression weights,...
متن کاملDistributed Privacy Preserving Data Mining: A framework for k-anonymity based on feature set partitioning approach of vertically fragmented databases
Recently, many data mining algorithms for discovering and exploiting patterns in data are developed and the amount of data about individuals that is collected and stored continues to rapidly increase. However, databases containing information about individuals may be sensitive and data mining algorithms run on such data sets may violate individual privacy. Also most organizations collect and sh...
متن کاملInstance reduction approach to machine learning and multi-database mining
The paper proposes a heuristic instance reduction algorithm as an approach to machine learning and knowledge discovery in centralized and distributed databases. The proposed algorithm is based on an original method for a selection of reference instances and creates a reduced training dataset. The reduced training set consisting of selected instances can be used as an input for the machine learn...
متن کاملRights Protection of Multidimensional Time-Series Datasets with Neighborhood Preservation
Industry companies frequently outsource datasets to mining firms and academic institutions create repositories and share datasets in the interest of promoting research collaboration. Still, many practitioners feel reserved about about sharing or outsourcing datasets, primarily because of the fear of losing the principal rights over the dataset. This work presents a way of convincingly claiming ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000